Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: detect and extract json from log-lines #1248

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Mar 18, 2025

Fixes #XXXX.

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features
    • Docker images now include additional resource files for enhanced deployment functionality.
    • Introduced configurable log format definitions to support diverse logging styles.
    • Enhanced log ingestion endpoints by enabling inline log extraction via custom HTTP headers.
    • Added a new constant for log extraction key in the handlers module.
    • Improved structure for representing unknown log sources in JSON output.
  • Refactor
    • Streamlined log source mapping for more consistent and reliable log processing.
    • Simplified syntax for handling log processing in utility functions.

Copy link

coderabbitai bot commented Mar 18, 2025

Walkthrough

This pull request adds a new directory copy in two Dockerfiles to include the resources directory during the build. A new JSON file defining log formats using regex patterns is introduced. A new module for known schema handling is added, which provides structures and methods for processing logs based on these formats. Additionally, HTTP ingestion functions are updated to extract log values from headers using a new constant, and tests as well as utility functions are refined to improve code clarity and control flow.

Changes

Files Change Summary
Dockerfile.debug, Dockerfile.kafka Added COPY resources ./resources after COPY src ./src to include the resources directory during image build.
resources/formats.json New JSON file defining multiple log formats with names and associated regex patterns for log parsing.
src/event/format/known_schema.rs New file introducing SchemaDefinition and EventProcessor structs, extraction methods, a global KNOWN_SCHEMA_LIST, and unit tests for structured log processing.
src/event/format/mod.rs Added public module known_schema and refined LogSource conversion by replacing a catch-all case with explicit matching.
src/handlers/http/ingest.rs Introduced an extract_log variable in multiple functions and updated error handling with a new KnownFormat variant, integrating log extraction via the known schema.
src/handlers/http/modal/utils/ingest_utils.rs Simplified the default branch in the flatten_and_push_logs function by condensing the syntax.
src/handlers/mod.rs Added the constant EXTRACT_LOG_KEY with value "x-p-extract-log".
src/handlers/http/logstream.rs Refactored test cases to use direct references for LogSource and applied pattern matching for unknown log source variants.
src/migration/stream_metadata_migration.rs Updated expected output structure in test_v5_v6_unknown_log_source to reflect a more structured format for log_source.

Suggested labels

for next release

Suggested reviewers

  • de-sh

🐇 In the code, we hop and play,
New logs and schemas come our way.
Resources added, a joy to see,
With regex patterns, we dance with glee!
So let's embrace this change, my friend,
For in our code, the fun won't end! 🐇


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@de-sh de-sh marked this pull request as ready for review March 18, 2025 16:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/handlers/http/ingest.rs (1)

69-73: Introducing extract_log from headers
Extracting the optional extract_log from request headers yields more flexibility to process inline logs. Consider bounding or validating the header string length if logs are large.

src/event/format/known_schema.rs (1)

38-103: SchemaDefinition with check_or_extract
This method systematically checks for existing fields before performing regex extraction. It's efficient, but consider providing more detailed error info for partial matches or missing groups.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 160dec4 and 231cc0d.

📒 Files selected for processing (9)
  • Dockerfile (1 hunks)
  • Dockerfile.debug (1 hunks)
  • Dockerfile.kafka (1 hunks)
  • resources/formats.json (1 hunks)
  • src/event/format/known_schema.rs (1 hunks)
  • src/event/format/mod.rs (1 hunks)
  • src/handlers/http/ingest.rs (9 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (3 hunks)
  • src/handlers/mod.rs (1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
  • new (105:110)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: coverage
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (22)
Dockerfile.kafka (1)

45-45: LGTM - Adding resources directory to Docker image

This change ensures the resources directory is available in the Kafka-enabled Docker image, which is needed for the new JSON extraction feature.

Dockerfile.debug (1)

32-32: LGTM - Adding resources directory to debug Docker image

This change ensures the resources directory is available in the debug Docker image, consistent with changes to other Dockerfiles.

Dockerfile (1)

33-33: LGTM - Adding resources directory to main Docker image

This change ensures the resources directory is available in the main Docker image, consistent with changes to other Dockerfiles.

src/handlers/mod.rs (1)

25-25: New header constant for log extraction feature

This constant defines a header key that will be used to control the JSON extraction functionality from HTTP requests. The naming convention is consistent with other header constants in this file.

You might want to verify all places where this constant is being used:

#!/bin/bash
# Find all usages of the new EXTRACT_LOG_KEY constant
rg "EXTRACT_LOG_KEY" --type rust
src/event/format/mod.rs (1)

42-42: New module for JSON extraction from log lines

This addition exposes the new known_schema module which implements the core functionality for detecting and extracting JSON data from log lines.

Ensure the module implementation includes proper error handling and testing:

#!/bin/bash
# Check if tests are included for the new module
fd "known_schema" --type file | grep test
resources/formats.json (1)

1-554:

❓ Verification inconclusive

Check for potential catastrophic backtracking in regex patterns
All patterns appear well-defined, but indefinite quantifiers such as .* may lead to performance overhead or catastrophic backtracking in certain edge cases. It's recommended to test or verify them under load to ensure performance is acceptable, especially if logs are large or patterns are processed frequently.


🏁 Script executed:

#!/bin/bash
# Searching for suspicious indefinite quantifiers in patterns.
rg -A2 '\.\*' resources/formats.json

Length of output: 28115


Performance Alert: Validate Regex Load Handling in formats.json
The regex patterns include several instances of unbounded quantifiers (e.g., (?<body>.*)). While these are consistently anchored (typically at the end of each pattern), which generally minimizes backtracking risks, their behavior under heavy or unusually large log entries remains uncertain. Please consider stress-testing these patterns with real-world log loads to confirm that they do not incur performance issues or catastrophic backtracking in production.

src/handlers/http/modal/utils/ingest_utils.rs (3)

26-26: Import of KNOWN_SCHEMA_LIST
No issues with the new import. It consolidates all known schemas in a single place, which is a good approach for maintainability.


38-41: New parameter extract_log
These changes introduce an Option<&str> to accommodate the extraction of raw log data from a specific field. This approach is flexible and aligns well with optional header usage.


72-76: Custom log extraction branch
Calling KNOWN_SCHEMA_LIST.extract_from_inline_log for LogSource::Custom is a solid extension. Any malformed or unknown format can gracefully fall back with the _ arm.

src/handlers/http/ingest.rs (7)

31-33: Importing Unacceptable and new header keys
Bringing in Unacceptable for error handling and EXTRACT_LOG_KEY constants is clear. This decouples the logic for recognized known formats from the rest of the code.


90-90: Invoking flatten_and_push_logs
Passing extract_log ensures custom inline log extraction for the default ingest path. Implementation is consistent with the newly introduced logic.


153-153: No extraction for OTEL paths
These lines pass None to flatten_and_push_logs, skipping inline log extraction. This is correct for OTEL ingestion flows.

Also applies to: 191-191, 231-231


273-277: extract_log usage in post_event
Mirroring the earlier approach for ingest, which keeps the logic consistent across ingestion endpoints.


285-285: Invoking flatten_and_push_logs
Same approach as in ingest, passing the optional extract_log. The code remains consistent.


351-353: Introducing KnownFormat variant
Attaching Unacceptable to PostError as KnownFormat clarifies error handling for unrecognized or invalid logs. This approach fosters clarity in user-facing error messages.


379-379: HTTP status for KnownFormat errors
Mapping KnownFormat to BAD_REQUEST is appropriate, as the log data doesn't match a recognized pattern. Ensure detailed messages guide users to correct their log format.

src/event/format/known_schema.rs (6)

1-17: License header updated
No issues found.


19-33: Use statements and static initialization
Declaring a lazy static KNOWN_SCHEMA_LIST ensures we only parse and compile the known formats once, improving performance. The approach is straightforward and maintainable.


34-36: Custom error Unacceptable
Having a specialized error type clarifies when the log content fails known format checks.


105-125: Data structures for known formats
Using Format and Pattern to store multiple regex definitions per format is flexible. The EventProcessor organizes them well.


127-202: Implementation of EventProcessor::new and extract_from_inline_log
Compiling regex patterns once ensures we avoid performance overhead at runtime. The approach to skipping unknown formats with a warning is practical, though consider if a fallback or partial extraction is desired.


205-474: Comprehensive test suite
The tests cover multiple scenarios, including edge cases like missing fields and partial matches. This broad coverage is commendable, ensuring confidence in the new extraction logic.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 18, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/event/format/known_schema.rs (3)

126-159: Consider handling regex compilation errors more robustly.

Currently, regex patterns that fail to compile are silently ignored with only a log message. This could lead to situations where expected patterns are not applied without a clear indication to the user.

Consider tracking failed patterns and potentially surfacing this information through a method that can report on schema health, such as:

 pub struct EventProcessor {
     /// Map of format names to their corresponding schema definitions
     pub schema_definitions: HashMap<String, SchemaDefinition>,
+    /// Patterns that failed to compile
+    compilation_errors: HashMap<String, Vec<String>>,
 }

 impl EventProcessor {
     /// Parses given formats from JSON text and stores them in-memory
     fn new(json_text: &str) -> Self {
         let mut processor = EventProcessor {
             schema_definitions: HashMap::new(),
+            compilation_errors: HashMap::new(),
         };

         // ... existing code ...

         if let Some(pattern) = regex.pattern.and_then(|pattern| {
             Regex::new(&pattern)
                 .inspect_err(|err| {
                     error!("Error compiling regex pattern: {err}; Pattern: {pattern}")
+                    processor.compilation_errors
+                        .entry(format.name.clone())
+                        .or_insert_with(Vec::new)
+                        .push(format!("{}: {}", pattern, err));
                 })
                 .ok()
         }) {
             schema.patterns.push(pattern);
         }

         processor
     }
+
+    /// Returns information about patterns that failed to compile
+    pub fn get_compilation_errors(&self) -> &HashMap<String, Vec<String>> {
+        &self.compilation_errors
+    }
 }

168-171: Incorrect error description comment.

The documentation comment for the Err(Unacceptable) return value incorrectly states "JSON provided is acceptable for the known format" when it should indicate the opposite.

 /// # Returns
 /// * `Ok` - The original JSON will now contain extracted fields
-/// * `Err(Unacceptable)` - JSON provided is acceptable for the known format  
+/// * `Err(Unacceptable)` - JSON provided is not acceptable for the known format

145-154: Consider verifying if any patterns were successfully compiled.

If all regex patterns for a format fail to compile, you'll end up with a schema that has field mappings but no patterns. This could lead to silent failures where extraction is attempted but can never succeed.

Consider adding a validation step after processing all patterns:

 for format in formats {
+    let mut has_valid_patterns = false;
     for regex in format.regex {
         let schema = processor
             .schema_definitions
             .entry(format.name.clone())
             .or_insert_with(SchemaDefinition::default);

         schema.field_mappings.push(regex.fields.clone());
         // Compile the regex pattern if present
         // NOTE: we only warn if the pattern doesn't compile
         if let Some(pattern) = regex.pattern.and_then(|pattern| {
             Regex::new(&pattern)
                 .inspect_err(|err| {
                     error!("Error compiling regex pattern: {err}; Pattern: {pattern}")
                 })
                 .ok()
         }) {
             schema.patterns.push(pattern);
+            has_valid_patterns = true;
         }
     }
+    
+    // If no patterns were valid, log a warning
+    if !has_valid_patterns && processor.schema_definitions.contains_key(&format.name) {
+        warn!("Format '{}' has no valid regex patterns, extraction will always fail", format.name);
+    }
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 231cc0d and 186a586.

📒 Files selected for processing (3)
  • src/event/format/known_schema.rs (1 hunks)
  • src/event/format/mod.rs (2 hunks)
  • src/handlers/http/ingest.rs (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/event/format/mod.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
  • new (106-111)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (11)
src/handlers/http/ingest.rs (8)

31-33: Imports added for new JSON extraction functionality.

The imports of Unacceptable and EXTRACT_LOG_KEY indicate the integration with the new JSON extraction feature from log lines.


69-73: Well-implemented extraction of optional log value from headers.

This code properly extracts an optional log format identifier from request headers, using the chain of and_then and to_str().ok() for safe conversion.


90-90: Header extraction passed to flattening function.

The extracted log value is now passed to the flatten_and_push_logs function, allowing for JSON extraction from log lines during ingestion.


153-153: Consistent use of None for OTEL endpoints.

You correctly pass None for the extract_log parameter in OTEL-specific handlers, indicating that no JSON extraction should occur for these specialized formats.

Also applies to: 191-191, 231-231


273-277: Consistent implementation of log extraction in post_event.

The same pattern for extracting optional log values from headers is implemented here, maintaining consistency across all ingestion endpoints.


285-285: Header extraction correctly passed to flattening function.

Similar to line 90, the extracted log value is passed to flatten_and_push_logs to enable JSON extraction during processing.


351-352: New error variant for schema validation.

A new KnownFormat error variant has been added to the PostError enum to handle cases where log format extraction fails, with automatic conversion from the Unacceptable error type.


379-379: Appropriate error status code for format validation failures.

The new error variant correctly returns a BAD_REQUEST status code, which is appropriate for client-provided data that doesn't match the expected format.

src/event/format/known_schema.rs (3)

27-32: Good implementation of global lazy-loaded schema definitions.

Using Lazy for the global instance ensures the JSON parsing only happens once at first access, which is efficient. The embedded JSON file through include_str! is a good approach for packaging the format definitions with the application.


34-36: Simple and focused error type.

The Unacceptable error type provides a clear message that includes the name of the format that failed to match, which will be helpful for debugging and user feedback.


60-102: Well-implemented field extraction with performance optimizations.

The check_or_extract method is well-designed with multiple efficiency considerations:

  1. Early return if fields already exist
  2. Graceful handling of missing log field
  3. Iterative pattern matching with early success return

This approach minimizes unnecessary regex processing which can be performance-intensive.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/event/format/known_schema.rs (1)

211-211: ⚠️ Potential issue

Avoid using unreachable!() for input validation.

Using unreachable!() for validating inputs is risky, as it will cause the application to panic if reached. Since this could be triggered by malformed client data, it should be handled more gracefully.

-           _ => unreachable!("We don't accept events of the form: {json}"),
+           _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (1)
src/handlers/http/ingest.rs (1)

289-297: Structured extraction for post_event handler.

The implementation mirrors the pattern in the ingest function but is structured differently. This section could potentially be refactored to reduce duplication and share code between the handlers.

Consider extracting the common log source handling logic into a separate function to reduce duplication between ingest and post_event:

- match &log_source {
-     LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => {
-         return Err(PostError::OtelNotSupported)
-     }
-     LogSource::Custom(src) => {
-         KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?;
-     }
-     _ => {}
- }
+ handle_log_extraction(&mut json, &log_source, extract_log)?;

// Add this helper function elsewhere in the file:
// fn handle_log_extraction(json: &mut Value, log_source: &LogSource, extract_log: Option<&str>) -> Result<(), PostError> {
//     match log_source {
//         LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => {
//             return Err(PostError::OtelNotSupported)
//         }
//         LogSource::Custom(src) => {
//             KNOWN_SCHEMA_LIST.extract_from_inline_log(json, src, extract_log)?;
//         }
//         _ => {}
//     }
//     Ok(())
// }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 186a586 and a37d368.

📒 Files selected for processing (3)
  • src/event/format/known_schema.rs (1 hunks)
  • src/handlers/http/ingest.rs (7 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/handlers/http/modal/utils/ingest_utils.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/handlers/http/ingest.rs (3)
src/event/format/mod.rs (2) (2)
  • new (106-111)
  • from (70-80)
src/event/format/known_schema.rs (1) (1)
  • new (147-167)
src/metadata.rs (1) (1)
  • new (95-130)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
🔇 Additional comments (17)
src/handlers/http/ingest.rs (10)

31-31: New import adds JSON schema extraction functionality.

The import includes Unacceptable error type and KNOWN_SCHEMA_LIST for validating and extracting structured data from logs according to predefined schemas.


33-33: New EXTRACT_LOG_KEY constant integration.

This import refers to a constant that specifies the HTTP header key used to identify the field containing the raw log data for extraction.


52-52: Function signature modification adds mutable JSON handling.

The parameter is now marked as mut json to allow modification of the JSON during log extraction.


69-73: New header extraction logic for log field identification.

This code extracts an optional field name from request headers that identifies which field in the JSON contains the raw log text to be parsed.


81-90: Enhanced log source handling with structured extraction.

The code now handles different log sources, particularly focusing on extracting structured data from custom logs using regex patterns. For OTel formats, it properly returns an error, while for custom sources it attempts to extract fields based on the defined schema.


91-91: Improved log source metadata with extracted fields.

This creates a LogSourceEntry that includes both the log source type and the set of fields that were successfully extracted, which enhances the metadata for the stream.


253-253: Mutable JSON parameter for in-place field extraction.

The JSON parameter is now mutable to allow the extraction process to modify it by adding structured fields parsed from log lines.


284-288: Consistent header extraction logic across handlers.

This implements the same header extraction logic as in the ingest function, ensuring consistent behavior when extracting the log field name.


365-366: New error variant for schema validation failures.

A new KnownFormat error type is added to handle cases where the log doesn't match the expected format according to the predefined schema.


393-393: Proper HTTP status code for format validation errors.

Returns a BAD_REQUEST status code when a log doesn't match the expected format, which is appropriate for client-side errors related to input validation.

src/event/format/known_schema.rs (7)

27-32: Global schema definitions from embedded JSON resource.

The code loads predefined log format definitions from an embedded JSON file and initializes a global singleton using the Lazy pattern, which ensures thread-safe initialization on first access.


34-36: Clear error type for format validation failures.

The Unacceptable error type provides meaningful error messages when logs don't match the expected format, improving diagnostics.


40-51: Resilient regex deserializer ignores invalid patterns.

The deserializer handles malformed regex patterns gracefully by emitting warnings rather than failing completely. This prevents a single bad pattern from breaking the entire schema.


64-67: Schema definition structure for pattern collections.

The SchemaDefinition struct provides a clean abstraction for managing multiple regex patterns for a single log format.


83-129: Efficient field extraction with preemptive checking.

The check_or_extract method first checks if all required fields are already present before attempting extraction, which optimizes performance by avoiding unnecessary regex operations.


179-215: Comprehensive JSON extraction with multi-format support.

The extract_from_inline_log method handles both single JSON objects and arrays of objects, providing flexibility for different log formats.


218-497: Comprehensive test suite for extraction functionality.

The tests cover various scenarios including successful extraction, handling of pre-existing fields, failed matches, and error cases, which ensures the robustness of the implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/event/format/known_schema.rs (2)

147-154: 🛠️ Refactor suggestion

Potential panic during initialization

Using expect when parsing the JSON schema definitions could cause a panic during application startup if the JSON is malformed.

Consider handling JSON parsing errors more gracefully:

- let formats: Vec<Format> =
-     serde_json::from_str(json_text).expect("Known formats are stored as JSON text");
+ let formats: Vec<Format> = match serde_json::from_str(json_text) {
+     Ok(formats) => formats,
+     Err(e) => {
+         error!("Failed to parse format definitions: {}", e);
+         Vec::new() // Return empty formats to allow application to start with reduced functionality
+     }
+ };

211-211: ⚠️ Potential issue

Avoid using unreachable!() for input validation

Using unreachable!() for handling non-object/array JSON values is risky, as it will cause the application to panic if reached, which could be triggered by malformed client data.

- _ => unreachable!("We don't accept events of the form: {json}"),
+ _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (1)
src/event/format/known_schema.rs (1)

179-215: Consider more granular error handling for array processing

Currently, if any event in an array fails extraction, the entire operation fails. Consider a more flexible approach that could skip or separately report problematic entries.

 pub fn extract_from_inline_log(
     &self,
     json: &mut Value,
     log_source: &str,
     extract_log: Option<&str>,
-) -> Result<HashSet<String>, Unacceptable> {
+) -> Result<(HashSet<String>, Vec<usize>), Unacceptable> {
     let Some(schema) = self.schema_definitions.get(log_source) else {
         warn!("Unknown log format: {log_source}");
-        return Ok(Default::default());
+        return Ok((Default::default(), vec![]));
     };

     let mut fields = HashSet::new();
+    let mut failed_indices = Vec::new();
     match json {
         Value::Array(list) => {
-            for event in list {
+            for (idx, event) in list.iter_mut().enumerate() {
                 let Value::Object(event) = event else {
+                    failed_indices.push(idx);
                     continue;
                 };
                 if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
                     fields.extend(known_fields);
                 } else {
-                    return Err(Unacceptable(log_source.to_owned()));
+                    failed_indices.push(idx);
                 }
             }
+            if !failed_indices.is_empty() && failed_indices.len() == list.len() {
+                return Err(Unacceptable(format!("All events failed extraction for format {}", log_source)));
+            }
         }
         Value::Object(event) => {
             if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
-                return Ok(known_fields);
+                return Ok((known_fields, vec![]));
             } else {
                 return Err(Unacceptable(log_source.to_owned()));
             }
         }
-        _ => unreachable!("We don't accept events of the form: {json}"),
+        _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
     }

-    Ok(fields)
+    Ok((fields, failed_indices))
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a37d368 and 199c4d4.

📒 Files selected for processing (1)
  • src/event/format/known_schema.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: coverage
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (5)
src/event/format/known_schema.rs (5)

40-51: Good approach for handling invalid regex patterns

The function logs any regex compilation errors but continues execution by returning None. This gracefully handles malformed patterns without disrupting the entire application.


83-128: Robust implementation of field extraction logic

The check_or_extract method follows a clear sequential approach: first checking if fields already exist, then attempting extraction if needed. The error handling is thorough with proper early returns.


185-188: Good error handling for unknown log formats

The code gracefully handles unknown log formats by logging a warning and returning an empty set of fields rather than failing the operation. This allows the application to continue functioning even when log formats aren't recognized.


25-25: Good use of tracing crate for error logging

Using the tracing crate for logging allows for structured logs with metadata, which is better than simple println! or eprintln! statements.


218-497: Comprehensive test coverage

The tests cover a wide range of scenarios including successful extractions, handling of pre-existing fields, error cases, and array processing. This thorough testing increases confidence in the code's reliability.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/event/format/known_schema.rs (2)

185-188: Consider more specific error handling for unknown formats.

When an unknown log format is requested, the code currently returns an empty HashSet after logging a warning. Consider whether returning an error would be more appropriate to provide clearer feedback to callers.

- let Some(schema) = self.schema_definitions.get(log_source) else {
-     warn!("Unknown log format: {log_source}");
-     return Ok(HashSet::new());
- };
+ let Some(schema) = self.schema_definitions.get(log_source) else {
+     warn!("Unknown log format: {log_source}");
+     return Err(Unacceptable(format!("Unknown log format: {}", log_source)));
+ };

197-202: Reconsider all-or-nothing approach for array items.

If one item in a JSON array fails extraction, the entire operation fails. Consider whether it might be better to extract what's possible and report partial success, especially for batch processing scenarios.

You could collect successful extractions and failures separately, then decide how to handle them based on application requirements.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 199c4d4 and 090e491.

📒 Files selected for processing (1)
  • src/event/format/known_schema.rs (1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
  • new (106-111)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: coverage
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (7)
src/event/format/known_schema.rs (7)

40-51: Good approach for handling regex compilation errors.

The deserialize_regex function handles regex compilation errors gracefully by logging them and continuing with None instead of causing a panic. This ensures that malformed regex patterns won't crash the application during startup.


147-154: Potential panic during initialization.

The new method uses expect when parsing the JSON schema definitions, which could cause a panic during application startup if the JSON is malformed.

Consider handling JSON parsing errors more gracefully:

- let formats: Vec<Format> =
-     serde_json::from_str(json_text).expect("Known formats are stored as JSON text");
+ let formats: Vec<Format> = match serde_json::from_str(json_text) {
+     Ok(formats) => formats,
+     Err(e) => {
+         error!("Failed to parse format definitions: {}", e);
+         Vec::new() // Return empty formats to allow application to start with reduced functionality
+     }
+ };

83-101: Optimized field presence check.

The method efficiently checks if fields are already present in the JSON object before attempting regex extraction. This optimization avoids unnecessary regex processing when data is already structured correctly.


103-125: Well-structured extraction logic.

The extraction logic is well-implemented - it tries each pattern sequentially, extracts fields with named capture groups, and extends the original object with the extracted values. The early return pattern is used effectively to stop processing once a matching pattern is found.


179-215: Handle all JSON value types properly.

The extract_from_inline_log method has good handling for JSON objects and arrays, but uses unreachable!() for other JSON types which could lead to application panics if unexpected input is received.

Replace the use of unreachable!() with a proper error return:

-           _ => unreachable!("We don't accept events of the form: {json}"),
+           _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),

218-497: Comprehensive test coverage.

The test suite is thorough and covers various scenarios:

  • Basic extraction functionality
  • Pre-existing fields detection
  • Error handling for non-matching logs
  • Missing fields and patterns
  • Object and array processing
  • Edge cases like unknown formats and missing log fields

103-109: Robust handling of missing patterns.

The code gracefully continues to the next pattern when encountering patterns that failed to compile (None value) or don't match the input. This ensures resilience against individual pattern failures.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 19, 2025
@de-sh
Copy link
Contributor Author

de-sh commented Mar 19, 2025

Failure condition when expected fields not present

curl --request POST \
  --url http://localhost:8000/api/v1/ingest \
  --header 'authorization: Basic YWRtaW46YWRtaW4=' \
  --header 'content-type: application/json' \
  --header 'x-p-log-source: access_log' \
  --header 'x-p-stream: tester' \
  --data '{"startTime": "20m", "endTime": "now"}'

Success otherwise

curl --request POST \
  --url http://localhost:8000/api/v1/ingest \
  --header ': ' \
  --header 'authorization: Basic YWRtaW46YWRtaW4=' \
  --header 'content-type: application/json' \
  --header 'x-p-log-source: access_log' \
  --header 'x-p-stream: tester' \
  --data '{"timestamp": "", "c_ip": "", "cs_username": "", "cs_method": "", "cs_uri_stem": "", "cs_uri_query": "", "sc_status": "", "body": "", "test": ""}'

Also uses regex to capture from text:

curl --request POST \
  --url http://localhost:8000/api/v1/ingest \
  --header 'authorization: Basic YWRtaW46YWRtaW4=' \
  --header 'content-type: application/json' \
  --header 'x-p-log-source: alb_log' \
  --header 'x-p-stream: tester' \
  --data '[{"log": "example.com:8080 192.168.1.100 server-name john.doe [10/Oct/2023:13:55:36 +0000] \"GET /products/123?sort=price HTTP/1.1\" 200 2048 \"https://example.com/categories\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\" {\"status\":\"success\",\"time\":123}"}]'

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/event/format/known_schema.rs (2)

148-168: 🛠️ Refactor suggestion

Handle JSON parsing errors gracefully in EventProcessor::new.

As noted earlier, the use of expect could cause a panic if the JSON is malformed. This method should be refactored to handle errors gracefully.


180-216: ⚠️ Potential issue

Avoid unreachable! in extract_from_inline_log.

Using unreachable!() for input validation at line 211 is risky, as it will panic if reached. Since this could be triggered by malformed client data, it should be handled more gracefully.

-           _ => unreachable!("We don't accept events of the form: {json}"),
+           _ => return Err(Error::Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (4)
src/handlers/http/ingest.rs (2)

84-94: Simplify the duplicate OTEL log source check.

There are two separate checks for OTEL log sources - one at lines 77-82 and another at lines 84-87. This is redundant and could be simplified.

- let fields = match &log_source {
-     LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => {
-         return Err(PostError::OtelNotSupported)
-     }
-     LogSource::Custom(src) => {
-         KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?
-     }
-     _ => HashSet::new(),
- };
+ let fields = if let LogSource::Custom(src) = &log_source {
+     KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?
+ } else {
+     HashSet::new()
+ };

292-300: Consider returning extracted fields for all log types.

The current implementation ignores the return value from extract_from_inline_log for Custom log sources, and doesn't attempt extraction for other non-OTEL sources. Consider using the extracted field information for all log types to improve consistency.

 match &log_source {
     LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => {
         return Err(PostError::OtelNotSupported)
     }
     LogSource::Custom(src) => {
-        KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?;
+        let fields = KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?;
+        // Potentially use the extracted fields here
     }
     _ => {}
 }
src/event/format/known_schema.rs (2)

1-17: Copyright date should be 2024, not 2025.

The copyright header includes the year 2025, but the current year is 2024. This should be corrected.

-/* Parseable Server (C) 2022 - 2025 Parseable, Inc.
+/* Parseable Server (C) 2022 - 2024 Parseable, Inc.

87-129: Improve the check_or_extract method's error reporting.

The method returns None on extraction failure but doesn't provide context on why extraction failed. It also returns Some(pattern.fields.clone()) when fields are already present, which might not be intuitive.

Consider refining the return type to provide more context:

- pub fn check_or_extract(
-     &self,
-     obj: &mut Map<String, Value>,
-     extract_log: Option<&str>,
- ) -> Option<HashSet<String>> {
+ pub fn check_or_extract(
+     &self,
+     obj: &mut Map<String, Value>,
+     extract_log: Option<&str>,
+ ) -> Result<HashSet<String>, &'static str> {
      if let Some(pattern) = self
          .patterns
          .iter()
          .find(|pattern| pattern.fields.iter().all(|field| obj.contains_key(field)))
      {
-         return Some(pattern.fields.clone());
+         return Ok(pattern.fields.clone());
      }

      let event = extract_log
          .and_then(|field| obj.get(field))
          .and_then(|s| s.as_str())?;
+         .ok_or("Log field not found or not a string")?;

      for format in self.patterns.iter() {
          let Some(pattern) = format.pattern.as_ref() else {
              continue;
          };
          let Some(captures) = pattern.captures(event) else {
              continue;
          };
          let mut extracted_fields = Map::new();

          // With named capture groups, you can iterate over the field names
          for field_name in format.fields.iter() {
              if let Some(value) = captures.name(field_name) {
                  extracted_fields.insert(
                      field_name.to_owned(),
                      Value::String(value.as_str().to_string()),
                  );
              }
          }

          obj.extend(extracted_fields);

-         return Some(format.fields.clone());
+         return Ok(format.fields.clone());
      }

-     None
+     Err("No matching pattern found")
  }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 090e491 and 0db29f1.

📒 Files selected for processing (2)
  • src/event/format/known_schema.rs (1 hunks)
  • src/handlers/http/ingest.rs (7 hunks)
🔇 Additional comments (9)
src/handlers/http/ingest.rs (7)

31-31: Good change: Including the known_schema module.

Adding the import for the known_schema module allows the ingest handlers to use the new functionality for detecting and extracting structured data from log lines.


33-33: Adding the EXTRACT_LOG_KEY import for header extraction.

This change adds the capability to extract log format information from request headers, which is needed for the new feature.


72-76: Extracting log format from headers.

This code properly extracts an optional log format identifier from request headers, which will be used to determine how to process the logs.


256-256: Taking mutable JSON input to support field extraction.

The mut modifier for JSON in the function signature is necessary to allow the new field extraction logic to modify the input JSON object.


287-291: Extracting log format from headers in post_event.

This code extracts the optional log format identifier from request headers, maintaining consistency with the changes in the ingest function.


368-369: Adding KnownFormat error variant.

Good addition of a new error variant to handle errors from the known_schema module, allowing proper error propagation and handling.


396-396: Properly mapping KnownFormat errors to BAD_REQUEST.

Setting the appropriate status code for the new error type ensures that clients receive clear feedback when there are issues with the log format.

src/event/format/known_schema.rs (2)

44-55: Good approach for regex deserialization.

The custom deserialization logic correctly handles invalid regex patterns by logging errors rather than causing failures. This makes the application more robust against configuration issues.


218-497: Comprehensive test coverage.

The test suite is thorough and covers many important scenarios:

  • Successful extraction from different log formats
  • Handling already existing fields
  • Non-matching logs
  • Missing log fields
  • Array and object processing

This comprehensive testing will help ensure the reliability of the feature.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ff8ed2 and b40779c.

📒 Files selected for processing (9)
  • Dockerfile.debug (1 hunks)
  • Dockerfile.kafka (1 hunks)
  • resources/formats.json (1 hunks)
  • src/event/format/known_schema.rs (1 hunks)
  • src/event/format/mod.rs (2 hunks)
  • src/handlers/http/ingest.rs (7 hunks)
  • src/handlers/http/logstream.rs (4 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (1 hunks)
  • src/handlers/mod.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/handlers/http/modal/utils/ingest_utils.rs
  • src/handlers/mod.rs
  • Dockerfile.debug
  • src/handlers/http/logstream.rs
  • Dockerfile.kafka
  • resources/formats.json
  • src/event/format/known_schema.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/handlers/http/ingest.rs (3)
src/event/format/known_schema.rs (1) (1)
  • new (148-168)
src/event/format/mod.rs (2) (2)
  • new (118-123)
  • from (82-92)
src/metadata.rs (1) (1)
  • new (95-130)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: coverage
🔇 Additional comments (9)
src/event/format/mod.rs (2)

42-42: New module addition

Adding the known_schema module is a good approach for handling structured log formats based on regex patterns. This aligns with the PR objective of detecting and extracting JSON from log lines.


89-90: Improved LogSource enum handling

This change enhances the readability and correctness of the LogSource enum conversion by:

  1. Explicitly handling empty strings ("") as LogSource::Json
  2. Creating a dedicated path for custom formats

This is a necessary improvement to support the new custom log format extraction feature.

src/handlers/http/ingest.rs (7)

31-31: New import for known schema handling

Importing the known_schema module and its KNOWN_SCHEMA_LIST which contains predefined schema definitions. This is essential for supporting the feature to detect and extract JSON from log-lines.


33-33: Added extract log key import

The EXTRACT_LOG_KEY constant is now imported to extract the log field identifier from HTTP headers, which will be used to specify which part of the log entry should be parsed for JSON extraction.


72-75: Extract log field from header

Good implementation of extracting an optional log field identifier from the request headers. This parameter allows clients to specify which field in the log entry contains JSON data that should be extracted.


84-95: Added JSON extraction for custom log formats

This implementation properly handles different log source types:

  1. Returns an appropriate error for unsupported OpenTelemetry formats
  2. Extracts JSON data from custom log formats using the schema processor
  3. Creates a log source entry with extracted fields

The extracted fields set will later be stored with the stream metadata, enhancing query capabilities.


369-369: Added known format error variant

The addition of the KnownFormat error variant to the PostError enum properly integrates errors from the known_schema module into the HTTP error handling system.


396-396: Error handling for known format errors

The error handling logic correctly maps KnownFormat errors to a BAD_REQUEST status code, which is appropriate for client-side errors related to invalid log format specifications.


52-95:

❓ Verification inconclusive

Verify the end-to-end functionality

The changes implement a feature for detecting and extracting JSON from log lines, but it would be beneficial to verify how it works end-to-end.


🏁 Script executed:

#!/bin/bash

# Check for the implementation of the EXTRACT_LOG_KEY constant
echo "Checking for EXTRACT_LOG_KEY constant definition:"
rg "EXTRACT_LOG_KEY.*=" --glob "src/handlers/mod.rs"

# Look at how EventProcessor is initialized with format definitions
echo -e "\nChecking EventProcessor initialization:"
rg "KNOWN_SCHEMA_LIST.*=.*Lazy" -A 10 --glob "src/event/format/known_schema.rs"

# Check if there's a formats.json file containing regex patterns
echo -e "\nChecking for formats.json file:"
fd "formats.json"

# If found, examine its contents to understand the supported log formats
echo -e "\nExamining format definitions:"
find . -name "formats.json" -exec cat {} \; | head -n 25

Length of output: 2901


Action required: Verify complete end-to-end log ingestion and extraction functionality

The feature for detecting and extracting JSON from inline log lines appears to be integrated as follows:

  • The EXTRACT_LOG_KEY constant is defined in src/handlers/mod.rs as "x-p-extract-log".
  • The inline log extraction via KNOWN_SCHEMA_LIST.extract_from_inline_log is used in the custom log source branch.
  • The format definitions in resources/formats.json are intact and appear to include the necessary regex patterns.

Please ensure that:

  • End-to-end tests cover requests with valid inline JSON logs, validating that extraction works as expected.
  • The error handling for internal stream names and unsupported log sources (e.g., Otel sources) is working correctly.
  • The integration between log ingestion, extraction, and downstream event processing (e.g., via EventProcessor) is fully validated.

@nikhilsinhaparseable
Copy link
Contributor

@de-sh i see stream-info response as below -

{
    "created-at": "2025-03-20T10:18:48.324194537+00:00",
    "first-event-at": "2025-03-20T06:18:48.326-04:00",
    "stream_type": "UserDefined",
    "log_source": [
        {
            "log_source_format": "Json",
            "fields": []
        },
        {
            "log_source_format": {
                "Custom": "haproxy_log"
            },
            "fields": [
                "accept_date",
                "pid",
                "tt",
                "frontend_name",
                "beconn",
                "process_name",
                "backend_name",
                "srv_queue",
                "tw",
                "termination_state",
                "client_port",
                "server_name",
                "backend_queue",
                "srv_conn",
                "bytes_read",
                "stopping_timeout",
                "retries",
                "client_ip",
                "logging_host",
                "feconn",
                "timestamp",
                "tc",
                "actconn"
            ]
        },
        {
            "log_source_format": {
                "Custom": "esx_syslog_log"
            },
            "fields": [
                "log_procname",
                "level",
                "log_syslog_tag",
                "opid",
                "body",
                "syslog_pri",
                "log_pid",
                "logger",
                "timestamp"
            ]
        }
    ]
}

there is a difference in the structure of log_source_format for JSON and others
can we update like - "log_source_format": "esx_syslog_log"

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/event/format/known_schema.rs (1)

153-154: 🛠️ Refactor suggestion

Potential panic during initialization

The new method uses expect when parsing the JSON schema definitions, which could cause a panic during application startup if the JSON is malformed.

Consider handling JSON parsing errors more gracefully:

- let formats: Vec<Format> =
-     serde_json::from_str(json_text).expect("Known formats are stored as JSON text");
+ let formats: Vec<Format> = match serde_json::from_str(json_text) {
+     Ok(formats) => formats,
+     Err(e) => {
+         error!("Failed to parse format definitions: {}", e);
+         Vec::new() // Return empty formats to allow application to start with reduced functionality
+     }
+ };
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b40779c and 4b6209b.

📒 Files selected for processing (9)
  • Dockerfile.debug (1 hunks)
  • Dockerfile.kafka (1 hunks)
  • resources/formats.json (1 hunks)
  • src/event/format/known_schema.rs (1 hunks)
  • src/event/format/mod.rs (3 hunks)
  • src/handlers/http/ingest.rs (7 hunks)
  • src/handlers/http/logstream.rs (4 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (1 hunks)
  • src/handlers/mod.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/handlers/mod.rs
🚧 Files skipped from review as they are similar to previous changes (5)
  • Dockerfile.kafka
  • src/handlers/http/modal/utils/ingest_utils.rs
  • Dockerfile.debug
  • src/handlers/http/logstream.rs
  • src/handlers/http/ingest.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
  • new (120-125)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (7)
resources/formats.json (1)

1-554:

❓ Verification inconclusive

Extensive collection of log format regex patterns added

This file provides a comprehensive set of regex patterns for parsing different log formats, which will greatly enhance Parseable's ability to extract structured data from various logging systems.

A few observations:

  1. Each format includes a body field to capture remaining content
  2. Multiple regex patterns per format handle variations of the same log type
  3. The patterns are detailed and account for many edge cases

Since this file is critical for log parsing functionality, let's verify that the regex patterns compile correctly:


🏁 Script executed:

#!/bin/bash
# Check if all regex patterns compile correctly
cd resources
jq -r '.[] | .regex[] | .pattern' formats.json | while read -r pattern; do
  if ! echo "" | grep -P "$pattern" &>/dev/null; then
    echo "Invalid regex pattern: $pattern"
  fi
done

Length of output: 64023


Action Required: Verify Regex Compatibility with Your Parser Engine

The shell script output reveals that nearly every regex pattern in the file fails to compile using our grep -P check. However, these errors are likely due to known limitations of the grep -P implementation—especially its handling (or lack thereof) of the named capture group syntax (e.g. (?<name>…) and (?P<name>…)). If your log parser uses a regex engine that fully supports these constructs (such as the Python re module or a fully featured PCRE library), these errors may not affect production functionality.

Points to confirm:

  • Ensure that the regex engine used in your production log parser fully supports the syntax used in these patterns.
  • Consider adding tests in your actual parser environment to verify pattern matching and data extraction.
  • If your production engine cannot support these constructs, you may need to refactor the regexes to use a compatible syntax.
src/event/format/mod.rs (3)

42-42: New module added for known schema handling

Adding this module enables the application to work with predefined log formats.


78-80: Custom log format support added to LogSource enum

The Custom(String) variant with #[serde(untagged)] attribute enables deserialization of user-defined log formats, allowing for more flexible log source handling.


91-92: Modified LogSource string conversion

The implementation now explicitly handles empty strings as JSON format and properly creates custom format types for other inputs. This is more explicit and clear compared to the previous catch-all approach.

src/event/format/known_schema.rs (3)

44-55: Good handling of regex compilation errors

The deserialize_regex function appropriately logs compilation errors but allows the application to continue running. This approach is resilient as it won't break the entire application if a single regex pattern is invalid.


87-129: Well-designed field extraction with fallback mechanism

The check_or_extract method intelligently:

  1. First checks if all fields are already present
  2. Only attempts extraction if needed
  3. Tries multiple patterns until one matches

This approach minimizes regex processing overhead and handles partial matches gracefully.


218-497: Comprehensive test coverage

The test module is thorough, covering:

  • Regular expression pattern matching
  • Field extraction logic
  • Handling of already present fields
  • Error cases like no match, missing fields
  • Processing of both single objects and arrays

This level of testing increases confidence in the implementation.

Comment on lines +28 to +32
const FORMATS_JSON: &str = include_str!("../../../resources/formats.json");

/// Global instance of EventProcessor containing predefined schema definitions
pub static KNOWN_SCHEMA_LIST: Lazy<EventProcessor> =
Lazy::new(|| EventProcessor::new(FORMATS_JSON));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider error handling during initialization

The use of a Lazy static ensures thread-safe initialization, which is good. However, the EventProcessor::new method uses expect which could cause a panic if the JSON is malformed, potentially bringing down the entire application during startup.

Consider implementing more graceful error handling during initialization:

pub static KNOWN_SCHEMA_LIST: Lazy<EventProcessor> =
-    Lazy::new(|| EventProcessor::new(FORMATS_JSON));
+    Lazy::new(|| {
+        match EventProcessor::try_new(FORMATS_JSON) {
+            Ok(processor) => processor,
+            Err(e) => {
+                error!("Failed to parse format definitions: {}", e);
+                EventProcessor { schema_definitions: HashMap::new() }
+            }
+        }
+    });

Then update the EventProcessor implementation:

impl EventProcessor {
    fn try_new(json_text: &str) -> Result<Self, serde_json::Error> {
        let mut processor = EventProcessor {
            schema_definitions: HashMap::new(),
        };

        let formats: Vec<Format> = serde_json::from_str(json_text)?;

        for format in formats {
            for regex in format.regex {
                let schema = processor
                    .schema_definitions
                    .entry(format.name.clone())
                    .or_default();

                schema.patterns.push(regex);
            }
        }

        Ok(processor)
    }
}

Comment on lines +180 to +216
pub fn extract_from_inline_log(
&self,
json: &mut Value,
log_source: &str,
extract_log: Option<&str>,
) -> Result<HashSet<String>, Error> {
let Some(schema) = self.schema_definitions.get(log_source) else {
return Err(Error::Unknown(log_source.to_owned()));
};

let mut fields = HashSet::new();
match json {
Value::Array(list) => {
for event in list {
let Value::Object(event) = event else {
continue;
};
if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
fields.extend(known_fields);
} else {
return Err(Error::Unacceptable(log_source.to_owned()));
}
}
}
Value::Object(event) => {
if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
return Ok(known_fields);
} else {
return Err(Error::Unacceptable(log_source.to_owned()));
}
}
_ => unreachable!("We don't accept events of the form: {json}"),
}

Ok(fields)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Robust method for extracting log fields from JSON structures

The extract_from_inline_log method handles both JSON objects and arrays, applies appropriate schema definitions, and collects field names from successful extractions.

However, there's one issue:

The unreachable! macro on line 211 will cause a panic if reached. Since this function processes user input, it should handle unexpected inputs gracefully.

-            _ => unreachable!("We don't accept events of the form: {json}"),
+            _ => return Err(Error::Unacceptable(format!("Unexpected JSON format: {}", log_source))),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn extract_from_inline_log(
&self,
json: &mut Value,
log_source: &str,
extract_log: Option<&str>,
) -> Result<HashSet<String>, Error> {
let Some(schema) = self.schema_definitions.get(log_source) else {
return Err(Error::Unknown(log_source.to_owned()));
};
let mut fields = HashSet::new();
match json {
Value::Array(list) => {
for event in list {
let Value::Object(event) = event else {
continue;
};
if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
fields.extend(known_fields);
} else {
return Err(Error::Unacceptable(log_source.to_owned()));
}
}
}
Value::Object(event) => {
if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
return Ok(known_fields);
} else {
return Err(Error::Unacceptable(log_source.to_owned()));
}
}
_ => unreachable!("We don't accept events of the form: {json}"),
}
Ok(fields)
}
}
pub fn extract_from_inline_log(
&self,
json: &mut Value,
log_source: &str,
extract_log: Option<&str>,
) -> Result<HashSet<String>, Error> {
let Some(schema) = self.schema_definitions.get(log_source) else {
return Err(Error::Unknown(log_source.to_owned()));
};
let mut fields = HashSet::new();
match json {
Value::Array(list) => {
for event in list {
let Value::Object(event) = event else {
continue;
};
if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
fields.extend(known_fields);
} else {
return Err(Error::Unacceptable(log_source.to_owned()));
}
}
}
Value::Object(event) => {
if let Some(known_fields) = schema.check_or_extract(event, extract_log) {
return Ok(known_fields);
} else {
return Err(Error::Unacceptable(log_source.to_owned()));
}
}
_ => return Err(Error::Unacceptable(format!("Unexpected JSON format: {}", log_source))),
}
Ok(fields)
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b6209b and f756639.

📒 Files selected for processing (1)
  • src/event/format/mod.rs (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
🔇 Additional comments (3)
src/event/format/mod.rs (3)

42-42: LGTM on adding known_schema module

This new module aligns with the PR's objective to detect and extract JSON from log lines, providing a structured approach to handling known log schemas.


91-92: Improvement in specificity

The code now explicitly handles empty strings and "json" separately from other custom formats, which improves clarity. This explicit handling aligns with the PR's objective of distinguishing between different log formats.


85-93:

❓ Verification inconclusive

Consider backwards compatibility implications

Changing the default behavior from treating all unmatched strings as LogSource::Json to treating them as LogSource::Custom(string) might impact existing code. This is a behavioral change that could affect systems that rely on the previous default.


🏁 Script executed:

#!/bin/bash
# Find places where LogSource is used to understand potential impacts
rg "LogSource::" --type rust -A 3 -B 3

# Check if there are any tests that might be relying on this behavior
rg "LogSource::from" --type rust -A 5

Length of output: 16163


Action Required: Verify Backwards Compatibility for LogSource Default Behavior

The change in the From<&str> implementation now produces LogSource::Custom(custom.to_owned()) for any string that isn’t one of the recognized inputs ("kinesis", "otel-logs", "otel-metrics", "otel-traces", "pmeta", or the empty string/"json"). Previously, any unrecognized string might have defaulted to LogSource::Json. Although tests (e.g., in src/handlers/http/logstream.rs and elsewhere) confirm that expected behavior has shifted—using LogSource::Custom for inputs like "teststream" while preserving Json for the empty or "json" cases—it’s important to ensure that:

  • External integrations or client code aren’t inadvertently relying on the old behavior (i.e., defaulting unrecognized strings to Json).
  • Documentation and/or migration guides clearly communicate this behavioral change so that systems expecting a Json default for unrecognized inputs can be reviewed and, if necessary, updated.
  • All integration tests and module usages (as seen in files like src/handlers/http/ingest.rs and src/utils/json/mod.rs) have been verified against this change to avoid unexpected runtime issues.

Please double-check the external dependencies and any integrations that assume an unrecognized log source will resolve to Json rather than a custom variant.

Comment on lines +78 to 80
// Custom Log Sources e.g. "syslog"
#[serde(untagged)]
Custom(String),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Check placement of serde attribute

The #[serde(untagged)] attribute is typically used on entire enums, not on specific variants. It instructs serde to try all possible deserialization paths. Placing it on a single variant might not have the intended effect.


🏁 Script executed:

#!/bin/bash
# Check if the untagged attribute is used similarly in other parts of the codebase
rg "#\[serde\(untagged\)\]" -A 2 -B 2

Length of output: 739


Action Required: Move #[serde(untagged)] from variant to enum level

The attribute is currently applied directly on the Custom(String) variant in src/event/format/mod.rs (lines 78–80), but that pattern isn’t effective. In other parts of the codebase (e.g., in src/rbac/user.rs and src/alerts/target.rs), the same attribute is applied to the entire enum. Please update the code so that the #[serde(untagged)] attribute decorates the enum instead of an individual variant to ensure the proper deserialization behavior.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
resources/formats.json (1)

1-600: Well-structured collection of log format definitions with comprehensive regex patterns.

This new file provides a robust set of regex patterns for parsing various log formats. The structure is consistent and well-organized, with each format having a name and associated regex patterns with named capture groups.

While the implementation appears solid, consider the following suggestions:

  1. Some regex patterns are quite complex (e.g., line 27 for ALB logs), which might impact performance during log parsing.
  2. Consider adding version information or a last updated timestamp to the file to track changes over time.
  3. Adding a short comment at the beginning of the file explaining its purpose and how it's used would improve maintainability.

Would you like me to suggest some optimizations for the more complex regex patterns or add documentation to explain the file's purpose?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85e13cc and 2364b39.

📒 Files selected for processing (2)
  • resources/formats.json (1 hunks)
  • src/migration/stream_metadata_migration.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
🔇 Additional comments (1)
src/migration/stream_metadata_migration.rs (1)

259-259: Update to log source structure in expected test output.

This change reflects the migration from a simple string representation of log source to a structured format with log_source_format and fields properties. This aligns with the new format definitions introduced in the formats.json file.

The test adjustment ensures that the v5_v6 migration properly handles unknown log source types by converting them to the new structured format.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
resources/formats.json (2)

1-22: Inconsistent Capture Group Syntax in "access_log" Definitions
The first regex pattern (lines 5–8) uses the syntax (?<timestamp>...) while the following ones use Python’s named group style (?P<name>...). To avoid possible confusion or parsing issues across different regex engines, please consider standardizing the capture group syntax consistently across all entries.


1-613: Overall Complexity and Maintainability Considerations
This file defines a very extensive list of log formats with intricate regex patterns. While the comprehensive nature of the patterns is commendable for parsing varied log types, please consider the following improvements:

  • Documentation: Add inline comments or maintain an external documentation file summarizing each log format’s purpose and the intended meaning of its capture groups. This will assist future maintainers when updating or debugging these patterns.
  • Modularization: As the file grows, it might be beneficial to split the formats into multiple files (e.g., grouping by log source) or add metadata that categorizes each entry. This could improve readability and ease future enhancements.

Overall, ensure that each regex pattern is well tested against representative log samples, especially given the mix in capture group conventions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2364b39 and fd65c35.

📒 Files selected for processing (3)
  • resources/formats.json (1 hunks)
  • src/handlers/http/ingest.rs (7 hunks)
  • src/handlers/http/modal/utils/ingest_utils.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/handlers/http/modal/utils/ingest_utils.rs
  • src/handlers/http/ingest.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: coverage

Comment on lines +518 to +525
{
"name": "web_robot_log",
"regex": [
{
"fields": ["ip", "timestamp", "method", "resource", "response", "bytes", "referrer", "request", "request-id", "useragent"]
}
]
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing Regex Pattern in "web_robot_log" Entry
The object for the "web_robot_log" (lines 519–525) contains only the "fields" array but does not provide a "pattern". Without a regex pattern, the parser will not be able to match any log lines for this format. Please add the appropriate regex pattern or, if intentionally omitted, include a comment explaining the rationale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants